{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "gpuType": "T4"
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "# Integrate LLM Frameworks\n",
        "\n",
        "The release of [BERT](https://arxiv.org/abs/1810.04805) in 2018 kicked off the language model revolution. The Transformers architecture succeeded RNNs and LSTMs to become the architecture of choice. Unbelievable progress was made in a number of areas: summarization, translation, text classification, entity classification and more. 2023 tooks things to another level with the rise of large language models (LLMs). Models with billions of parameters showed an amazing ability to generate coherent dialogue.\n",
        "\n",
        "Looking ahead towards the next wave of innovation, we're due for another shift in model architecture. For example, the [Mamba paper](https://arxiv.org/abs/2312.00752) previews a possible future after Transformers.\n",
        "\n",
        "With that in mind, [txtai](https://github.com/neuml/txtai) now has the capability to easily integrate additional LLM frameworks. While local models through Hugging Face Transformers continues to be the default choice, these additional LLM frameworks broaden the number of options available.\n",
        "\n",
        "This notebook will demonstrate how txtai can integrate with [llama.cpp](https://github.com/ggerganov/llama.cpp), [LiteLLM](https://github.com/BerriAI/litellm) and custom generation methods. For custom generation, we'll show how to run inference with a `Mamba` model."
      ],
      "metadata": {
        "id": "VGeVB8M41jqW"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Install dependencies\n",
        "\n",
        "Install `txtai` and all dependencies. Since this notebook is using optional libraries, we need to install the `pipeline-llm` extras package."
      ],
      "metadata": {
        "id": "ZQrHIw351lwE"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {
        "id": "R0AqRP7v1hdr"
      },
      "outputs": [],
      "source": [
        "%%capture\n",
        "!pip install git+https://github.com/neuml/txtai#egg=txtai[pipeline-llm]"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# llama.cpp\n",
        "\n",
        "First, we'll demonstrate how to load a model with llama.cpp. This framework is an extremely popular method with those who run local LLMs. It provides a number of innovations in running LLMs on CPUs, especially on Mac's.\n",
        "\n",
        "The following example shows a retrieval augmented generation (RAG) pipeline with llama.cpp. txtai automatically loads llama.cpp models when working with GGUF files."
      ],
      "metadata": {
        "id": "32xg8L1JHd3d"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from txtai.embeddings import Embeddings\n",
        "from txtai.pipeline import Extractor, LLM\n",
        "\n",
        "data = [\n",
        "  \"US tops 5 million confirmed virus cases\",\n",
        "  \"Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg\",\n",
        "  \"Beijing mobilises invasion craft along coast as Taiwan tensions escalate\",\n",
        "  \"The National Park Service warns against sacrificing slower friends in a bear attack\",\n",
        "  \"Maine man wins $1M from $25 lottery ticket\",\n",
        "  \"Make huge profits without work, earn up to $100,000 a day\"\n",
        "]\n",
        "\n",
        "# Create embeddings\n",
        "embeddings = Embeddings(content=True, autoid=\"uuid5\")\n",
        "\n",
        "# Create an index for the list of text\n",
        "embeddings.index(data)\n",
        "\n",
        "# Create LLM with llama.cpp - GGUF file is automatically downloaded\n",
        "llm = LLM(\"TheBloke/Mistral-7B-OpenOrca-GGUF/mistral-7b-openorca.Q4_K_M.gguf\", verbose=True)\n",
        "\n",
        "template = \"\"\"<|im_start|>system\n",
        "You are a friendly assistant. You answer questions from users.<|im_end|>\n",
        "<|im_start|>user\n",
        "Find the best matching text in the context for the question. The response should be the text from the context only.\n",
        "\n",
        "Question:\n",
        "{question}\n",
        "\n",
        "Context:\n",
        "{context}\n",
        "\n",
        "Text:\n",
        "<|im_end|>\n",
        "<|im_start|>assistant\n",
        "\"\"\"\n",
        "\n",
        "# Create and run extractor instance\n",
        "extractor = Extractor(embeddings, llm, output=\"reference\", separator=\"\\n\", template=template)\n",
        "result = extractor(\"Tell me about someone lucky\")\n",
        "\n",
        "print(\"ANSWER:\", result[\"answer\"])\n",
        "print(\"REFERENCE:\", embeddings.search(\"select id, text from txtai where id = :id\", parameters={\"id\": result[\"reference\"]}))\n"
      ],
      "metadata": {
        "id": "XZ7vPBIs1rGZ",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "18554fc7-1383-46dd-fd7d-87d816747fad"
      },
      "execution_count": 7,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "ANSWER: Maine man wins $1M from $25 lottery ticket\n",
            "REFERENCE: [{'id': '37e5fae7-74c2-5f1c-bf69-2962dd7470d1', 'text': 'Maine man wins $1M from $25 lottery ticket'}]\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "The code above builds an embeddings database, runs a vector search and passes those results to a LLM prompt. As expected, it prints the best answer and reference. The difference is that LLM inference is run through `llama.cpp` vs `transformers`."
      ],
      "metadata": {
        "id": "_qDllo4AIky2"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# LiteLLM\n",
        "\n",
        "LiteLLM is an abstraction framework designed to run with API-based LLMs. At the time of writing this article, LiteLLM supports over 100+ LLMs. See the [full list of providers](https://docs.litellm.ai/docs/providers) for all the options.\n",
        "\n",
        "The following example shows a LLM call with the Hugging Face Inference API. This method automatically detects that this is a LiteLLM model string."
      ],
      "metadata": {
        "id": "83BSRNPAI6Q9"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Hugging Face Inference API\n",
        "llm = LLM(\"huggingface/roneneldan/TinyStories-1M\")\n",
        "print(llm(\"The cat and the dog.\", maxlength=5))"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "ye0DshJF4MLj",
        "outputId": "6bc74aa4-201b-4b68-9545-a1110778e0b2"
      },
      "execution_count": 14,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "They are friends.\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Given that all these APIs require a paid account, we'll leave it to you to try other API models using your own authentication."
      ],
      "metadata": {
        "id": "bCpx6BtOJgw2"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Custom Generation\n",
        "\n",
        "Last but certainly not least, we'll demonstrate how to add a custom generation framework. For this example, we'll use the recently released [mamba-chat](https://huggingface.co/havenhq/mamba-chat) model to build a RAG pipeline. You can read more about the model in this [GitHub Repository](https://github.com/havenhq/mamba-chat)\n",
        "\n",
        "The following sections install support for Mamba models, define a Mamba Generation instance and run a Mamba-based RAG pipeline."
      ],
      "metadata": {
        "id": "43x3YPZVJ2SA"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "%%capture\n",
        "!pip install mamba-ssm\n",
        "\n",
        "# Link CUDA libraries into environment\n",
        "!export LC_ALL=\"en_US.UTF-8\"\n",
        "!export LD_LIBRARY_PATH=\"/usr/lib64-nvidia\"\n",
        "!export LIBRARY_PATH=\"/usr/local/cuda/lib64/stubs\"\n",
        "!ldconfig /usr/lib64-nvidia"
      ],
      "metadata": {
        "id": "j6GQo1_J_HpU"
      },
      "execution_count": 4,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "import torch\n",
        "\n",
        "from transformers import AutoTokenizer\n",
        "from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel\n",
        "\n",
        "from txtai.pipeline import Generation\n",
        "\n",
        "\n",
        "class MambaGeneration(Generation):\n",
        "    def __init__(self, path, template=None, **kwargs):\n",
        "        super().__init__(path, template, **kwargs)\n",
        "\n",
        "        self.tokenizer = AutoTokenizer.from_pretrained(path)\n",
        "        self.tokenizer.eos_token = \"<|endoftext|>\"\n",
        "        self.tokenizer.pad_token = self.tokenizer.eos_token\n",
        "\n",
        "        self.model = MambaLMHeadModel.from_pretrained(path, device=\"cuda\", dtype=torch.float16)\n",
        "\n",
        "    def execute(self, texts, maxlength, **kwargs):\n",
        "        results = []\n",
        "        for text in texts:\n",
        "            # Tokenize prompt\n",
        "            tokens = self.tokenizer(text, return_tensors=\"pt\").to(\"cuda\")[\"input_ids\"]\n",
        "\n",
        "            # Run inference\n",
        "            output = self.model.generate(input_ids=tokens, max_length=maxlength, eos_token_id=self.tokenizer.eos_token_id, **kwargs)\n",
        "\n",
        "            # Decode results\n",
        "            output = self.tokenizer.batch_decode(output)\n",
        "            output = output[0].split(\"<|assistant|>\\n\")[-1].replace(\"<|endoftext|>\", \"\").strip()\n",
        "            results.append(output)\n",
        "\n",
        "        return results"
      ],
      "metadata": {
        "id": "1uYgXT_q_U9U"
      },
      "execution_count": 5,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "llm = LLM(\"havenhq/mamba-chat\", method=\"__main__.MambaGeneration\")\n",
        "\n",
        "template = \"\"\"<|system|>You are a friendly assistant. You answer questions from users.</s>\n",
        "<|user|>\n",
        "Find the best matching text in the context for the question. The response should be the text from the context only.\n",
        "\n",
        "Question:\n",
        "{question}\n",
        "\n",
        "Context:\n",
        "{context}\n",
        "</s>\n",
        "<|assistant|>\n",
        "\"\"\"\n",
        "\n",
        "# Create and run extractor instance\n",
        "extractor = Extractor(embeddings, llm, output=\"reference\", separator=\"\\n\", template=template)\n",
        "result = extractor(\"Tell me something about about wildlife\")\n",
        "\n",
        "print(\"ANSWER:\", result[\"answer\"])\n",
        "print(\"REFERENCE:\", embeddings.search(\"select id, text from txtai where id = :id\", parameters={\"id\": result[\"reference\"]}))"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "4vZUYLJB_jvb",
        "outputId": "cb8f81f5-6744-4450-9132-dadb16b9096c"
      },
      "execution_count": 6,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "ANSWER: The National Park Service warns against sacrificing slower friends in a bear attack.\n",
            "REFERENCE: [{'id': '7224f159-658b-5891-b06c-9a96cfa6a54d', 'text': 'The National Park Service warns against sacrificing slower friends in a bear attack'}]\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "As expected, the best answer and reference is shown.\n",
        "\n",
        "There is much to learn and validate about Mamba but it's important to note this model is only 2.8B parameters. The Mamba architecture is one to watch moving forward!"
      ],
      "metadata": {
        "id": "i_7fgP4qLveq"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Wrapping up\n",
        "\n",
        "This notebook demonstrated how to run LLMs through txtai using alternate LLM frameworks. It's an exciting time in AI/NLP/Machine Learning. What new innovations will 2024 bring? Time will tell but txtai is ready to integrate them in!"
      ],
      "metadata": {
        "id": "oPwgCgBc2Er2"
      }
    }
  ]
}